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ABSTRACT 

Item response theory models arose from the inherent limitations of classical test 
theory methods of test analysis. A brief description of those limitations and the 
corresponding enhancements provided by item response models is provided. Further, an 
examination of the popular Rasch one-parameter latent trait model is undertaken. 
Specific explanation of the step-by-step calculations in the one-parameter model is 
accomplished using a commonly available spreadsheet. This paper is designed to be used 
as a teaching heuristic to assist students in understanding both the mechanics and the 
rationale behind the item response theory (IRT) model measurement. 



The author would like to express his thanks to Bruce Thompson for his comments on an 
earlier draft of this paper. 
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When they were first introduced, item response theory (IRT)/latent trait 
measurement models were heralded as “one of the most important methodological 
advances in psychological measurement in the past half century” (McKinley and Mills 
1989, p. 71). However, the pluses and minuses of these models have been hotly debated 
(cf. Lawson 1991) despite their widespread use in various applications such as test 
equating, item selection and adaptive testing. 

This paper will begin with a brief examination of classical test theory and some of 
its inherent weaknesses. An encompassing examination of classical test theory is beyond 
the scope of the paper. Readers desiring greater discourse on the subject are directed to 
Crocker and Algina (1986) and Nunnelly and Bernstein (1994). The focus then shifts to 
item response theory with discussion centering on the theoretical framework of the Rasch 
one-parameter IRT model. The concepts underlying and the basic tenets of item response 
theory are explored. Finally, step-by-step calculations involved in the Rasch model will 
be explained using a commonly available spreadsheet. Such spreadsheets can be valuable 
heuristic devises to assist students in truly understanding what is occurring in IRT 
measurement. 



Classical Test Theory 

Classical test theory (CTT) and its related methods has a number of limitations. 
For example, comparison across examinees is limited to situations where the subjects of 
interest are administered the same (parallel) test items. Also, a false presumption of CTT 
is that the variance of errors of measurement is the same for all examinees. Reality 
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dictates that some people perform tasks more consistently than others and that 
consistency varies with ability (Hambleton and Swaminathan 1985). 

Two major CTT limitations of note are: 

1 . Examinee characteristics cannot be separated from test characteristics, and 

2. CTT is test-oriented rather than item-oriented. 

The first limitation above can be summarized as a situation of circular 
dependency. The examinee statistic (i.e., observed score) is item sample-dependent while 
the item statistics (i.e., item difficulty, item discrimination) are examinee sample- 
dependent . Stated simply, when the test is ‘difficult’, examinees will appear to have lower 
ability and when the test is ‘easy’, they will appear to have higher ability. Likewise, the 
‘difficulty’ of a test item is determined by the proportion of examinees who answer it 
correctly and is thus dependent on the abilities of the examinees being measured 
(Hambleton, Swaminathan and Rogers 1991). This circular dependency poses some 
theoretical difficulties in CTT’s application in measurement situations such as test 
equating and computerized adaptive testing. 

The second major limitation listed is a question of orientation. The CTT model 
fails to allow us to predict how an examinee, given a stated ability level, is likely to 
respond to a particular item (Hambleton et al. 1991). Predicting how an individual 
examinee or a group of examinees will perform on a specific item is quite relevant to a 
number of testing applications. Consider the difficulties facing a test designer who wishes 
to predict test scores across multiple groups, or to design an equitable test for a particular 
group, or possibly to compare examinees who take either different tests or the same test at 
differing times. Such inherent limitations of CTT led psychometricians to develop models 
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that overcame not only these limitations but also led to improved bias detection, enhanced 
reliability assessment and increased precision in ability measurement. Item response 
theory (Hambleton and Swaminathan 1985; Hambleton et al. 1991; Lord 1980) provides 
us with a framework to accomplish these desired features. 

Item Response Theory Concept 

Item response theory (IRT) arose out of a psychometric need to overcome the 
limitations of classical test theory and to provide test designers with improved and more 
accurate testing tools. Again, a thorough discussion of IRT is beyond the bounds of the 
present study. Interested readers are directed to Hambleton et al. (1991) and Wright and 
Stone (1979). IRT primarily rests upon two basic postulates (Hambleton et al. 1991; 
Hambleton and Swaminathan 1985): 

1 . The performance of an examinee on a test item can be explained (or predicted) 

by a set of factors called traits , latent traits or abilities', and 

2. The relationship between examinees’ item performance and the trait(s) 

underlying item performance can be described by a monotonically 
increasing function called the item characteristic curve (ICC). 

Several IRT models exist, including the three-parameter, two parameter and one- 
parameter models. The one-parameter model, often referred to as the Rasch model, is the 
most commonly used example and will be the focus of this paper. The models differ 
principally in the mathematical form of the ICC and/or the number of parameters 
specified in the model. 
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When an IRT model fits the test data of interest, many of the limitations of CTT 
are resolved. For example, examinee latent trait estimates are theoretically no longer test- 
dependent and item indices are no longer group-dependent. Ability estimates derived 
from different groupings of items will be the same, barring measurement error, and item 
parameter estimates derived from different groups of examinees will also be the same, 
barring sampling error (Hambleton et al. 1991). 



Assumptions of Item Response Models 

Unidimensionality and local independence are two assumptions that are 
fundamental to IRT. The unidimensionality assumption requires that only one ability or 
latent trait is measured by the various items that make up the test. Intuitively, this 
assumption cannot be strictly satisfied due to the reality that multiple factors will 
normally impact the test taking performance by an examinee. Exogenous factors such as 
generic cognitive ability, test anxiety and motivation level are likely to impact test 
performance as well. In order for a set of test data to satisfy the assumption of 
unidimensionality, a ‘dominant’ factor influencing performance must be present 
(Hambleton et al. 1991). This dominant factor is referred to as the ability or latent trait 
measured by the test. 

The assumption of local independence requires that an examinee’s responses to 
the various items in a test are statistically independent of each other (Hambleton and 
Swaminathan 1985). This implies that an examinee’s response to any one item will not 
affect their response to any other item in the test. Simply put, the trait specified in the 
model is the only factor influencing the respondent’s answer to the test items (Hambleton 
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et al. 1991) and one item does not hold clues for subsequent items. It is important to note 
that the assumption of local independence does not imply that the test items are 
uncorrelated across the total group of examinees (Lord and Novak 1968). Whenever there 
is variation among the examinees on the measured ability, positive correlations between 
pairs of items will result. However, item scores are uncorrelated at a fixed ability level 
(Hambleton and Swaminathan 1985). 

There are three primary advantages to using item response models (Hambleton 
and Swaminathan 1985): 

1 . Assuming the existence of a large pool of items each measuring the same latent 

trait, the estimate of an examinee’s ability is independent of a particular 
sample of test items that are administered to the examinee; 

2. Assuming the existence of a large population of examinees, the descriptors of a 

test item (e.g., item difficulty, item discrimination) are independent of the 
particular sample of examinees drawn for the purpose of item calibration; 
and 

3. A statistic indicating the precision with which each examinee’s ability is 

estimated is provided. 

Thus, the primary argument for employing IRT methods is that the resulting analyses are 
both person-free and sample-free measurements (McKinley and Mills 1989). It should be 
noted that not all researchers agree that IRT offers us such rich benefits. Lawson (1991) 
subjected three test data sets to both classical and Rasch procedures and found 
“remarkable similarities” between the results. Findings for both examinee abilities and 
item difficulties yielded “almost identical information.” Given the mathematical 
intricacies of IRT that are not required of classical methods, Lawson questioned the 
necessity of the Rasch procedure. That is, once misfitting items and people are removed 
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from the analysis, IRT and CTT models seem to yield highly correlated person ability and 
item difficulty estimates. The Rasch model continues, however, to be utilized by 
psychometricians. The recent rise in adaptive testing bears testament to the continued use 
of IRT. 



THE RASCH MODEL CALCULATIONS 

Having noted the basic deficiencies of classical test theory and the improvements 
that the more theoretically-based item response theory provides us, attention is now 
focused on the step-by-step calculations in the one-parameter IRT measurement. The 
Rasch calculations can appear daunting to many students. While extremely powerful in 
its applications, the fundamentals of IRT are actually quite straightforward and should not 
be viewed as a black box process. It is hoped that the following discussion will facilitate 
the conceptual grasp of the subject. 

In the following data example, presume that 35 people were tested on an 18 item 
exam. Since the object of the item response model is to predict performance based on 
item calibrations that are independent of the persons generating the data (i.e., person free) 
and examinee ability estimates that are independent of the items used in the measurement 
(i.e., item free), all items that are answered either correctly or incorrectly by everyone will 
be removed from further analysis. Likewise, any person who answered either 0% or 100% 
of the items correctly will also be removed since neither can be calibrated against the 
group and thus provide us with no usable information. That is, such items and people 
provide no information to facilitate the estimation process (e.g., the person with all of the 
items correct may be exactly smart enough to do that, or may have any of the infinite 
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ability levels above the exact ability that is just sufficient to yield this perfect score. The 
resulting data set after this initial cut of the information can be seen in Table 1 . 

Insert Table 1 about here 

Table 1 is laid out so that examinees are sorted in increasing order of number of 
items answered correctly while items are sorted in increasing order of number of 
examinees that correctly answered the item. Since responses were dichotomously scored 
as either right or wrong, a ‘0’ in the table denotes an incorrect answer while a ‘ 1 ’ denotes 
a correct response. Looking at Table 1, examinee 25 answered the fewest number of 
questions correctly (2) while examinees 24, 34 and 7 answered the greatest number of 
items correctly (11). Remember that any examinee who scored perfectly 100 or zero has 
been removed. Any ‘perfect’ items have also been removed. In this data set, item 
numbers 1,2,3 and 18 were removed while examinee 35 was removed. This editing of the 
data continues in this manner until no ‘perfect’ items or persons remain. 

Given this initial editing of the data, the next step in the process is to calibrate 
both the item difficulties and the person abilities. In order for us to make valid 
assessments and predictions arising from the Rasch model, both of these statistics 
(difficulties and abilities) must be linear and in the same metric . In IRT, this is 
accomplished by converting the values into logits. Logits for item difficulties are 
calculated as the natural log of the proportion of items incorrect divided by the proportion 
correct. 1 Conversely, the logit calculation for person ability is the natural log of the 
proportion of items that an examinee correctly answered divided by the proportion 
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answered incorrectly. These conversions from proportions to logits can be seen in greater 
detail in Tables 2 and 3, respectively. 

Insert Tables 2 & 3 about here 

Once the logit values for both the persons and items are calculated, we have 
overcome another weakness associated with CTT. Namely, while item difficulty and 
person ability levels realistically range from negative infinity to positive infinity, the 
proportion correct/incorrect are bound by the values of zero and one. Conversion to logits 
transforms the values into a +/- 00 scale. One further step involves calculating the mean 
and standard deviation of the data and converting the logits to a standard (z) scale 
arbitrarily assigning a center point value of zero. While the scale theoretically runs from - 
00 to + °°, values realistically tend to vary between +/- 3 logits. Table 4 highlights the 
relationship between proportion (of correct responses) and personal ability logits for this 
data set. Additionally, Figure 1 graphically portrays the relationship between proportions 
and logits. 

Insert Table 4 and Figure 1 about here 

Two final calibration steps remain. The initial measurement of item difficulty 
must be corrected for the difficulty dispersion of the items. Additionally, the initial 
measurement of person ability needs to be corrected for the ability dispersion of persons. 
Calculations are modeled in Tables 5 and 6 and result in item calculations that are 
corrected for sample spread and person calculations that are corrected for test width. This 

2 /n[pi/(l-Pi)] 
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step is known as the calculation of expansion factors and is crucial given the premise of 
one-parameter IRT that the achievement of any person on a given item is solely 
dependent upon that person’s ability and the difficulty of the specific item. 

Insert Tables 5 & 6 about here 

The final step in the modeling process is to fit the model, obtained via the 
preceding steps, to the data and evaluate the goodness of fit. One can not merely assume 
that the preceding steps are sufficient in developing an effective model. If we re-examine 
Table 1, a pattern in responses should emerge. Since items are ordered by increasing level 
of difficulty and examinees are progressively ranked according to correct responses, we 
would intuitively expect to see more incorrect responses to the top and right of the table 
and vice versa. In other words, we would expect a person who is higher on a latent trait 
(0, theta) to have a greater chance of answering a difficult question than a person who is 
lower on that latent trait. Table 7 accentuates the different ‘expectations’. 

While those persons or items that do not fit well with the model are statistically 
identified by the software program used to calculate the Rasch model, some of the 
potential misfits are circled here in order to visually highlight points where the model 
does not perfectly fit the data. It should be noted that both items and persons can be 
identified as aberrant. For example, person 13 answered item 12 correctly when the 
expectation (given other responses) would be that the item would be answered 
incorrectly. Also, items 6 and 8 were answered incorrectly by person 12 when the 
expectation would be a correct response. 
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Insert Table 7 about here 

Once an IRT computer program identifies misfits between the model and the data, 
the source of the variance (item- or person-based) is explored. In examining Table 7, 
persons 13 and 29 appear to post responses that are aberrant to expectations. Likewise, 
items 6, 7, 8 and 12 do not fit with expectations. Each of these variants in the table are 
circled for easier identification. As mentioned previously, the source of the 
inconsistencies can originate at either the item or person level. Table 8 simulates how the 
software program would investigate the irregularities caused by persons 29 and 13, for 
example. 

Each item has a calculated difficulty level (d) and each person has a calculated 
ability level (theta). The first step in the analysis of fit is to determine the difference 
between the ability level and the difficulty level for each person and each item. Table 8 
highlights the values involved for persons 29 and 13. When the difference in the two 
values is a positive number, it is an indication that that particular item should be ‘easy’ 
for that particular examinee and should be answered correctly. The higher the number, the 
greater the likelihood of a correct response. Conversely, the more negative the difference, 
the greater the likelihood that the item difficulty exceeds the person’s ability. 

Insert Table 8 about here 

Looking at Table 8, it appears that person 29 missed item 7, when they should 
have theoretically answered it correctly, while correctly answering item 14, when the 
probability was that it would be missed by a person with a theta equal to zero. Person 13 
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missed both items 6 and 7 while getting item 1 2 correct— all opposite of expectations. 
Remember, however, that the source of variance can result from item irregularities as 
well. Table 9 facilitates our understanding of how item response patterns are examined 
for misfitting results. The process is similar to the aforementioned one. This table 
illustrates the examination of items 6 and 7 for all persons. Again, the process occurs for 
all items and all persons. 



Both items 6 and 7 have fairly high negative logit values for item difficulty levels 
indicating that they should be answered correctly by most examinees. Indeed, an 
examination of the results in Table 9 shows that only three of the 34 examinees missed 
item 7 while only four missed item 6. It appears that it is not items 6 and 7 that are 
causing the irregularity between the model and the data but rather examinees 13 and 29. 
This is exactly what is occurring. The removal of these two persons from the data set 
eliminates most of the irregularity associated with the two items. 

Upon removal of persons 1 3 and 29, the process iterates and a new evaluation of 
fit is calculated for the remaining distributions. Again, all combinations of persons and 
items are examined. At the point at which no further removal of either items or persons 
enhances the goodness of fit, the model is said to “fit” the data and the result is items that 
are theoretically both unidimensional and independent. By eliminating both items and 
individuals that deviate from expectations, we can develop a test bank of items that 
should optimally fit the individual person ability levels for most test takers. 



Insert Table 9 about here 
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Figure 2 illustrates one-parameter item characteristic curves (ICC) for four 
hypothetical items. Latent ability (0) is represented, in logits, along the x-axis. The 
probability of a correct response is located on the y-axis. Since in the one-parameter IRT 
model no traits other than ability (e.g., guessing) are assumed to impact responses, the 
curves are asymptotic to the zero and one points of the probability distribution. The 
difficulty level of each item is defined as the logit point at which the probability of 
answering the item correctly is 50% (p = 0.50). Therefore, those items with curves that 
are toward the right side of x-axis are more difficult than those to the left. For example, 
the item difficulty for item 3 is -1.0 while the item difficulty for item 2 is approximately 
+2.0. Therefore, persons with an ability (0) equal to zero would probably answer item 3 
correctly and miss items 1 and 2. There is a 50% chance of the person answering item 4 
correctly. 

Insert Figure 2 about here 

Figures 3 and 4 are simply added as a point of comparison and for further 
edification. The two-parameter model assumes two parameters are affecting examinee 
responses: ability and item discrimination. Curve endpoints are still asymptotic as 

answers can only be correct or incorrect. With the two-parameter curve, the slope of the 
curve indicates how well the item differentiates between persons with varying latent 
abilities. For instance, item 2 in Figure 3 has a much flatter slope than that of item 4. 
Therefore, item 4 is a better discriminating item. 

Figure 4, the three-parameter model, adds a third variable to the equation— the 
effect of guessing. Here, the curve endpoint may begin at a value other than zero as the 
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impact of correctly guessing an item is taken into account. The evaluation guidelines that 
applied to the other two ICCs apply here as well; however, the location of the initial 
endpoint gives the researcher an indication as to how effective item distracters may be. 
For example, items 3 and 6 appear to be potentially guessed correctly whereas items 2 
and 4 do not. 

Insert Figures 3 and 4 about here 



SUMMARY 

Item response theory models arose from the inherent limitations of classical test 
theory methods of test analysis. Chief among the limitations is that examinee 
characteristics can not be separated from test characteristics. Item response theory 
overcomes these limitations and rests on two major assumptions: (a) the performance of 
an examinee can be explained by a set of factors known as traits, and (b) the relationship 
between an individual’s item performance can be described by a monotonically increasing 
function termed an item characteristic curve. 

Item response theory allows the researcher to develop test questions that are 
theoretically both person-free and item-free. IRT stresses maximizing the test 
information function over the range of abilities that are of interest instead of maximizing 
reliability, as does classical psychometrics. While the usefulness of IRT continues to be 
debated, IRT appears to hold many benefits. Among these are a more accurate ability to 
detect item or test bias, the ability to administer customized, individualized, computer- 
adaptive tests and the ability to construct more effective tests, in general. It is hoped that 
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this paper has facilitated a better understanding of both the mechanics and the rationale 
behind item response theory (IRT) measurement. 
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Logit Conversion Chart 
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Response Patterns of Persons to Items 
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Table 9 

Fit Analysis for Items 6 and 7 
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Figure 1 

Scatterplot of Proportions to Logits 
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Figure 2 

One-Parameter Model 
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Figure 3 

Two-Parameter Model 
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Figure 4 

Three-Parameter Model 
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